Generalizability of Abusive Language Detection Models on Homogeneous German Datasets
نویسندگان
چکیده
Abstract Abusive language detection has become an integral part of the research, as reflected in numerous publications and several shared tasks conducted recent years. It been shown that obtained models perform well on datasets which they were trained, but have difficulty generalizing to other datasets. This work also focuses model generalization, – contrast previous we use homogeneous for our experiments, assuming a higher generalizability. We want find out how similar be trained generalize whether generalizability depends method used obtain model. To this end, selected four German from popular tasks, three are consecutive GermEval tasks. Furthermore, evaluate two deep learning methods traditional machine derive trends based results. Our experiments show generalization is only partially given, although annotation schemes these almost identical. findings additionally solely (combinations of) training sets consistent no matter what underlying is.
منابع مشابه
Abusive Language Detection on Arabic Social Media
In this paper, we present our work on detecting abusive language on Arabic social media. We extract a list of obscene words and hashtags using common patterns used in offensive and rude communications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classification, and we report results on a n...
متن کاملAbusive Language Detection in Online User Content
Detection of abusive language in user generated online content has become an issue of increasing importance in recent years. Most current commercial methods make use of blacklists and regular expressions, however these measures fall short when contending with more subtle, less ham-fisted examples of hate speech. In this work, we develop a machine learning based method to detect hate speech on o...
متن کاملOn generalizability of MOOC models
The big data imposes the key problem of generalizability of the results. In the present contribution, we discuss statistical tools which can help to select variables adequate for target level of abstraction. We show that a model considered as over-fitted in one context can be accurate in another. We illustrate this notion with an example analysis experiment on the data from 13 university Massiv...
متن کاملDimensions of Abusive Language on Twitter
In this paper, we use a new categorical form of multidimensional register analysis to identify the main dimensions of functional linguistic variation in a corpus of abusive language, consisting of racist and sexist Tweets. By analysing the use of a wide variety of parts-ofspeech and grammatical constructions, as well as various features related to Twitter and computer-mediated communication, we...
متن کاملUnderstanding Abuse: A Typology of Abusive Language Detection Subtasks
As the body of research on abusive language detection and analysis grows, there is a need for critical consideration of the relationships between different subtasks that have been grouped under this label. Based on work on hate speech, cyberbullying, and online abuse we propose a typology that captures central similarities and differences between subtasks and we discuss its implications for dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Datenbank-Spektrum
سال: 2023
ISSN: ['1618-2162', '1610-1995']
DOI: https://doi.org/10.1007/s13222-023-00438-1